120 research outputs found
Evaluating Emotional Nuances in Dialogue Summarization
Automatic dialogue summarization is a well-established task that aims to
identify the most important content from human conversations to create a short
textual summary. Despite recent progress in the field, we show that most of the
research has focused on summarizing the factual information, leaving aside the
affective content, which can yet convey useful information to analyse, monitor,
or support human interactions. In this paper, we propose and evaluate a set of
measures , to quantify how much emotion is preserved in dialog summaries.
Results show that, summarization models of the state-of-the-art do not preserve
well the emotional content in the summaries. We also show that by reducing the
training set to only emotional dialogues, the emotional content is better
preserved in the generated summaries, while conserving the most salient factual
information
Cross-domain Voice Activity Detection with Self-Supervised Representations
Voice Activity Detection (VAD) aims at detecting speech segments on an audio
signal, which is a necessary first step for many today's speech based
applications. Current state-of-the-art methods focus on training a neural
network exploiting features directly contained in the acoustics, such as Mel
Filter Banks (MFBs). Such methods therefore require an extra normalisation step
to adapt to a new domain where the acoustics is impacted, which can be simply
due to a change of speaker, microphone, or environment. In addition, this
normalisation step is usually a rather rudimentary method that has certain
limitations, such as being highly susceptible to the amount of data available
for the new domain. Here, we exploited the crowd-sourced Common Voice (CV)
corpus to show that representations based on Self-Supervised Learning (SSL) can
adapt well to different domains, because they are computed with contextualised
representations of speech across multiple domains. SSL representations also
achieve better results than systems based on hand-crafted representations
(MFBs), and off-the-shelf VADs, with significant improvement in cross-domain
settings
Can GPT models Follow Human Summarization Guidelines? Evaluating ChatGPT and GPT-4 for Dialogue Summarization
This study explores the capabilities of prompt-driven Large Language Models
(LLMs) like ChatGPT and GPT-4 in adhering to human guidelines for dialogue
summarization. Experiments employed DialogSum (English social conversations)
and DECODA (French call center interactions), testing various prompts:
including prompts from existing literature and those from human summarization
guidelines, as well as a two-step prompt approach. Our findings indicate that
GPT models often produce lengthy summaries and deviate from human summarization
guidelines. However, using human guidelines as an intermediate step shows
promise, outperforming direct word-length constraint prompts in some cases. The
results reveal that GPT models exhibit unique stylistic tendencies in their
summaries. While BERTScores did not dramatically decrease for GPT outputs
suggesting semantic similarity to human references and specialised pre-trained
models, ROUGE scores reveal grammatical and lexical disparities between
GPT-generated and human-written summaries. These findings shed light on the
capabilities and limitations of GPT models in following human instructions for
dialogue summarization
From speech to facial activity: towards cross-modal sequence-to-sequence attention networks
Abstract
Multimodal data sources offer the possibility to capture and model interactions between modalities, leading to an improved understanding of underlying relationships. In this regard, the work presented in this paper explores the relationship between facial muscle movements and speech signals. Specifically, we explore the efficacy of different sequence-to-sequence neural network architectures for the task of predicting Facial Action Coding System Action Units (AUs) from one of two acoustic feature representations extracted from speech signals, namely the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPs) or the Interspeech Computational Paralinguistics Challenge features set (ComParE). Furthermore, these architectures were enhanced by two different attention mechanisms (intra- and inter-attention) and various state-of-the-art network settings to improve prediction performance. Results indicate that a sequence-to-sequence model with inter-attention can achieve on average an Unweighted Average Recall (UAR) of 65.9 % for AU onset, 67.8 % for AU apex (both eGeMAPs), 79.7 % for AU offset and 65.3 % for AU occurrence (both ComParE) detection over all AUs.2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP)
DOI: 10.1109/MMSP46350.2019
Funding : BMW Group Researc
AV+ EC 2015--the first affect recognition challenge bridging across audio, video, and physiological data
We present the first Audio-Visual+ Emotion recognition Challenge and workshop (AV+EC 2015) aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological emotion analysis. This is the 5th event in the AVEC series, but the very first Challenge that bridges across audio, video and physiological data. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the audio, video and physiological emotion recognition communities, to compare the relative merits of the three approaches to emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge, the dataset and the performance of the baseline system
AV+ EC 2015--the first affect recognition challenge bridging across audio, video, and physiological data
We present the first Audio-Visual+ Emotion recognition Challenge and workshop (AV+EC 2015) aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological emotion analysis. This is the 5th event in the AVEC series, but the very first Challenge that bridges across audio, video and physiological data. The goal of the Challenge is to provide a common benchmark test set for multimodal information processing and to bring together the audio, video and physiological emotion recognition communities, to compare the relative merits of the three approaches to emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge, the dataset and the performance of the baseline system
The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism
The INTERSPEECH 2013 Computational Paralinguistics Challenge provides for the first time a unified test-bed for Social Signals such as laughter in speech. It further introduces conflict in group discussions as new tasks and picks up on autism and its manifestations in speech. Finally, emotion is revisited as task, albeit with a broader ranger of overall twelve emotional states. In this paper, we describe these four Sub-Challenges, Challenge conditions, baselines, and a new feature set by the openSMILE toolkit, provided to the participants.
\em Bj\"orn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer}\\
{\em Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, Erik Marchi, }\\
{\em Hugues Salamin, Anna Polychroniou, Fabio Valente, Samuel Kim
AVEC 2016 – Depression, mood, and emotion recognition workshop and challenge
The Audio/Visual Emotion Challenge and Workshop (AVEC 2016) "Depression, Mood and Emotion" will be the sixth competition event aimed at comparison of multimedia processing and machine learning methods for automatic audio, visual and physiological depression and emotion analysis, with all participants competing under strictly the same conditions. The goal of the Challenge is to provide a common benchmark test set for multi-modal information processing and to bring together the depression and emotion recognition communities, as well as the audio, video and physiological processing communities, to compare the relative merits of the various approaches to depression and emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. This paper presents the challenge guidelines, the common data used, and the performance of the baseline system on the two tasks
Adieu Features? End-to-End Speech Emotion Recognition using a Deep Convolutional Recurrent Network
The automatic recognition of spontaneous emotions from speech is a challenging task. On the one hand, acoustic features need to be robust enough to capture the emotional content for various styles of speaking, and while on the other, machine learning algorithms need to be insensitive to outliers while being able to model the context. Whereas the latter has been tackled by the use of Long Short-Term Memory (LSTM) networks, the former is still under very active investigations, even though more than a decade of research has provided a large set of acoustic descriptors. In this paper, we propose a solution to the problem of `context-aware' emotional relevant feature extraction, by combining Convolutional Neural Networks (CNNs) with LSTM networks, in order to automatically learn the best representation of the speech signal directly from the raw time representation. In this novel work on the so-called end-to-end speech emotion recognition, we show that the use of the proposed topology significantly outperforms the traditional approaches based on signal processing techniques for the prediction of spontaneous and natural emotions on the RECOLA database
- …